The Northern Ireland local group of the RSS held an online meeting using MS Teams on Tuesday, 20 April 2021, at 2pm on 'Mathematical methods to accelerate numerical algorithms applied to genomic data'.
The speaker was Dr Sofya Titarenko, senior lecturer in mathematics at the University of Huddersfield, who presented work done in collaboration with Dr Valeriy Titarenko from the University of Manchester.
Processing and analysing genomic data are key procedures in many applications, such as the development of targeted medicines or investigation of triggers for diseases, such as cancer, Alzheimer or Covid. Short/long read alignment is often used as a preliminary step before application of data analytics/machine learning tools.
Many popular sequence alignment algorithms (such as BWA or Bowtie) for next-generation sequencing were developed more than a decade ago. The algorithms are based on Borrows-Wheeler transform, originally used for data compression, but it demonstrated its efficiency for sequence matching.
Components of modern computers make it possible to try new approaches in sequence alignment. The most important areas of progress are related to the size of random-access memory and read/write speeds for storage devices available, even for budget computers.
In this work we demonstrate a new approach to sequence alignment. The algorithm work with the different format of data storage which allows us to apply different algorithms for seeding and comparing sequences. The format of data storage is chosen in a way to take advantage of fast SIMD (Single Instruction Multiple Data) instructions. The seeding algorithm designed to reduce the number of candidate positions, while allowing a certain number of mismatches in a candidate place.
Human reference genome (GRCh38.p13) contains 3.2 billion symbols (letters A, C, G, T or N). For efficient comparison of sequences, we can use the following technique:
- Divide reference genome in chunks of 32 symbols;
- Code each chunk with four 32-bit numbers (a bit is 1 if a given symbol is present, otherwise 0). We use 72% more memory but faster comparison of two 32-symbol strings.
- For each position find a "signature" (a number based on neighboring symbols).
- Form a list of pairs ("position", "signature") and sort it by "signature" values.
"Signatures" are based on spaced seeds when some symbols are ignored, so sequences with mismatches may have the same "signatures".
We represent spaced seeds in the form of binary vectors ε ∈ S(n, k, m, w), where m n is length of a seed, where m ≤ n is length of a seed, n=32,40,48,56,64 is a length of a read, k is length of a read, k is length of a mismatches and w is the spaced seed's weight. We assume that:
- ∀ε ∈ S(n, k, m, w), ∀p > n : ε ∈ S(p, k, m, w)
- ∀ε ∈ S(n, k, m, w), ∀r < k : ε ∈ S(n, r, m, w)
- Seeds have maximum weight from all possible combinations of patterns
- Seeds have a periodic structure.
Using the properties listed above, we generate spaced seeds in such a way to be able to align sequences with the largest possible number of mismatches. On the other hand, by increasing the seeds' weight we reduce the number of candidate positions within a reference genome, thus improving performance of an alignment procedure. This representation and fast SIMD instructions allow us to optimise searching procedure.
This was an interesting and detailed talk on developing computer algorithms designed to align with current hardware in order to improve performance. The talk was well received by the online audience of 15 participants who were not that familiar with this area. Sofya was asked about the computer environment, the language being used and the improvements in speed expected. The audience thanked Sofya for a stimulating talk and showed its appreciation in the usual way.
Author
Dr Magda Bucholc is a lecturer in data analytics at the Intelligent Systems Research Centre at Ulster University.